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ABSTRACT 

In  this  paper,  we  present  dictionary  learning  methods  for  sparse  and 
redundant  signal  representations  in  high  dimensional  feature  space. 
Using  the  kernel  method,  we  describe  how  the  well-known  dictio¬ 
nary  learning  approaches  such  as  the  method  of  optimal  directions 
and  K-SVD  can  be  made  nonlinear.  We  analyze  these  constructions 
and  demonstrate  their  improved  performance  through  several  exper¬ 
iments  on  classification  problems.  It  is  shown  that  nonlinear  dic¬ 
tionary  learning  approaches  can  provide  better  discrimination  com¬ 
pared  to  their  linear  counterparts  and  kernel  PCA,  especially  when 
the  data  is  corrupted  by  noise. 

Index  Terms —  Kernel  methods,  dictionary  learning,  method  of 
optimal  directions,  K-SVD. 

1.  INTRODUCTION 

Sparse  and  redundant  signal  representations  have  recently  drawn 
much  interest  in  vision,  signal  and  image  processing  [1],  This  is  due 
in  part  to  the  fact  that  signals  and  images  of  interest  can  be  sparse 
or  compressible  in  some  dictionary.  The  dictionary  can  be  either 
based  on  a  mathematical  model  of  the  data  or  it  can  be  learned  di¬ 
rectly  from  the  data.  It  has  been  observed  that  learning  a  dictionary 
directly  from  the  training  data  rather  than  using  a  predetermined  dic¬ 
tionary  (i.e.  wavelet)  usually  leads  to  a  more  compact  representation 
and  hence  can  provide  improved  results  in  many  practical  image 
processing  applications  such  as  restoration  and  classification  [1], 

Several  algorithms  have  been  developed  for  the  task  of  learn¬ 
ing  dictionaries.  Two  of  the  most  well-known  algorithms  are  the 
method  of  optimal  directions  (MOD)  [2]  and  the  K-SVD  algorithm 
[3],  Given  a  set  of  examples  Y  =  [yi,  •  •  •  ,  y„],  the  goal  of  the 
K-SVD  and  MOD  algorithms  is  to  find  a  dictionary  D  and  a  sparse 
matrix  X  that  minimize  the  following  representation  error 

(D,  X)  =  arg  min  ||Y  —  DX||J-  subject  to  ||xi||o  <  To  Vi, 

where  x;  represent  the  columns  of  X  and  the  £$  sparsity  measure 
||.||o  counts  the  number  of  nonzero  elements  in  the  representation. 
Here,  ||A||f  denotes  the  Frobenius  norm.  Both  MOD  and  K-SVD 
are  iterative  methods  and  they  alternate  between  sparse-coding  and 
dictionary  update  steps. 

The  representation  obtained  by  learning  a  dictionary  can  be  fur¬ 
ther  enhanced  by  exploiting  the  nonlinearities  present  in  the  data  [4], 
[5],  For  instance,  in  [6]  it  is  shown  that  if  the  nonlinear  sparsity  is 
properly  exploited  then  one  can  accurately  recover  nonlinearly  K- 
sparse  signals  from  approximately  2I\  measurements,  which  is  far 
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fewer  than  the  number  of  measurements  usually  required  for  sig¬ 
nals  that  are  sparse  in  an  orthonormal  basis.  In  this  paper,  using 
kernel  methods,  we  develop  dictionary  learning  algorithms  that  take 
into  account  the  nonlinear  structure  of  data.  Our  dictionary  learn¬ 
ing  methods  yield  representations  that  are  more  compact  than  kernel 
PCA  and  able  to  handle  non-linearity  better  than  its  linear  counter¬ 
parts.  Fig.  1,  presents  an  important  comparison  in  the  representation 
power  of  kernel  PCA  and  a  learned  kernel  dictionary.  A  comparison 
of  the  mean-squared-error  (MSE)  of  an  image  from  the  USPS  dataset 
when  approximated  from  rn  kernel  PCA  components  and  m.  kernel 
dictionary  atoms  (denoted  by  kernel  KSVD)  shows  that  the  MSE  de¬ 
cays  much  faster  when  a  learned  non-linear  dictionary  is  used.  This 
example  shows  that  the  image  is  nonlinearly  sparse  and  learning  a 
dictionary  in  the  high  dimensional  feature  space  can  provide  better 
representation  of  data. 


Fig.  1.  Comparison  of  error  percentage  using  kernel  K-SVD  and 
kernel  PCA. 

Background  and  problem  formulation:  Let  <f>  :  — >  F  be 

a  non-linear  mapping  from  R^  into  a  higher  dimensional  feature 
space  F.  Since  the  feature  space  F  can  be  very  high  dimensional, 
in  the  kernel  methods,  Mercer  kernels  are  usually  employed  to  carry 
out  the  mapping  implicitly.  A  Mercer  kernel  is  a  function  k,(x.  y) 
that  for  all  data  {yi}  gives  rise  to  a  positive  semidefinite  matrix 
Kij  =  /t(y,;,  y7).  It  can  be  shown  that  using  k  instead  of  dot  prod¬ 
uct  in  input  space  corresponds  to  mapping  the  data  with  some  map¬ 
ping  $  into  a  feature  space  F.  That  is,  k(x,  y)  =  ($(x),$(y)). 
Some  commonly  used  kernels  include  polynomial  kernels  rc(x,  y)  = 
((x,  y)  +  c)d  and  Gaussian  kernels  «(x,  y)  =  exp(—  l|x~y|1  ), 
where  c  and  d  are  the  parameters.  Thus,  any  algorithm  that  can 
be  formulated  in  terms  of  dot  products  can  be  carried  out  in  some 
feature  space  F  without  mapping  the  data  explicitly  by  substituting 
a  chosen  kernel. 

In  this  paper,  we  will  use  the  following  model  for  the  dictionary 
D:  D  =  BA,  where  B  is  some  predefined  base  dictionary  and  A 
is  the  atom  representation  dictionary  [7],  The  base  dictionary  B  can 
be  chosen  such  that  it  incorporates  some  prior  knowledge  about  the 
data.  This  model  provides  adaptivity  via  modification  of  the  matrix 
A.  Let  <3?(Y)  denote  the  matrix  whose  columns  are  obtained  by 
embedding  the  input  signals  Y  =  [yi,  •  ■  •  ,yn]  into  some  feature 


space  using  the  mapping  $.  That  is,  <f>(Y)  =  [<h(yi),  •  •  •  ,  4>(yn)]. 
Furthermore,  we  denote  the  learned  dictionary  in  the  feature  space 
as  <3>(D).  Since  dictionary  atoms  lie  within  the  subspace  spanned  by 
the  input  data,  we  can  write  <f>(D)  =  4>(Y)  A.  where  A  is  the  atom 
representation  dictionary  and  $(Y)  is  the  base  dictionary. 

Our  goal  is  to  find  the  best  dictionary  <I>(D)  via  A  to  represent 
the  data  in  the  feature  space  {'I>(yi)}iLi  as  sparse  compositions  by 
solving  the  following  optimization  problem 

argmin  ||$(Y)  -  $(Y)AX|||  s.t  ||x;||o  <  T0,  Vi.  (1) 

A,X 

The  objective  function  in  (1)  can  be  rewritten  as  in  Eq.  (2),  which 
explicitly  depends  on  the  kernel  matrix  IK,  but  not  the  mapping  <f> 

||S(Y)-$(Y)AX|||  =  tr((I-AX)TK(Y,Y)(I-AX)),  (2) 


Input:  A  signal  z,  a  kernel  function  k,  A,  and  a  sparsity  level  To. 

Task:  Find  a  coefficient  vector  x  E  with  at  most  To  non-zero  coefficients  such 
that  $(Y)Ax  approximates  *f>(z). 

Initialize:  s  =  0,  Jo  =  0,  xo  =  0,  zo  =  0 

Procedure: 

1.  Ti  =  (k(z,  Y)  -  zfK(Y,  Y))  a,-,  Vi  £  i 

2.  imax  =  argmax^lril,  Vi  ^  Is-i 

3.  Update  the  index  set  Is  =  Is  —  i  U  * max 

4.  xa  =  (AfsK(Y.Y)A/s)_1(K(z,Y)A/a)T 

5.  za  =  A/a  xa 

6.  s  < —  s  +  1;  Repeat  steps  1-6  To  times 

Output:  Sparse  vector  x  6  satisfying  x(/s  (j))  =  xs(j),Vj  6  Is  and  zero 
elsewhere. 

Fig.  2.  The  KOMP  algorithm. 


where  K(Y,  Y)  £  R"*’1  is  a  positive  semidefinite  matrix  whose 
elements  are  computed  from  the  Mercer  kernel 

P(Y,Y)]y  =  [<*(Y),S(Y)>]„  =  K(yt,yj). 

Equipped  with  the  above  notation,  in  the  following  section,  we 
present  two  algorithms  for  learning  a  dictionary  in  the  feature  space. 

2.  KERNEL  DICTIONARY  LEARNING 

Just  as  in  the  case  of  K-S  VD  and  MOD,  our  method  of  learning  dic¬ 
tionaries  involve  two  stages:  sparse  coding  and  dictionary  update. 
In  what  follows,  we  describe  them  in  detail. 

Sparse  coding:  In  this  stage,  the  matrix  A  is  assumed  to  be  fixed. 
With  this,  we  seek  for  the  sparse  codes  contained  in  the  matrix  X. 
Note  that,  the  penalty  term  in  (1)  can  be  re-written  as 


The  algorithm  then  selects  a  new  dictionary  atom  in  the  remaining 
set  that  gives  largest  projection  coefficient  in  Eq.  (4).  This  selection 
guarantees  the  biggest  reduction  of  approximation  error. 

Let  A ie  indicates  the  set  of  dictionary  atoms  whose  indices  are 
from  the  set  Is.  We  want  to  project  the  signal  <f>(z)  onto  the  subspace 
spanned  by  the  selected  dictionary  atoms  <f>(Y)A/a.  The  projection 
coefficients  are  simply  obtained  as  follows: 

xs  =  ((<F(Y)A/a)T(4>(Y)A/a))~1  (<f>(Y)A/a)T$(z) 

=  (Af  K(Y ,  Y)A/s)_1(]K(z,  Y)A/s)T  (5) 

Note  that  the  computation  in  Eq.  (5)  can  be  efficiently  implemented 
in  a  recursive  manner  as  in  [9].  Once  the  coefficients  xs  are  found, 
the  approximating  signals  zs  are  updated  as  in  the  step  5  of  Fig.  2. 
The  procedure  is  repeated  until  To  atoms  are  selected. 


P(Y)  -  <&(Y)AX||2f  =  £  p(yi)  -  $(Y)AXi||2. 

i=i 

Hence,  the  problem  in  (1)  can  be  reformulated  as  solving  n  different 
problems  of  the  following  form 

min  ||$(yi)  -  3>(Y)Axj||l  s.t  ||xf||0  <  T0,  (3) 

xi 

for  i  =  1,  ■  •  •  ,  n.  The  above  problem  can  be  solved  by  any  pursuit 
algorithms  [8,  9],  with  the  modification  that  signals  are  now  in  the 
feature  space  with  a  very  high  dimension.  In  the  following  section, 
we  show  how  the  well-known  orthogonal  matching  pursuit  algo¬ 
rithm  (OMP)  [9,  10]  can  be  generalized  using  kernels  to  solve  (3). 

Kernel  Orthogonal  Matching  Pursuit  (KOMP):  Given  a  signal 
z  €  and  the  kernel  dictionary  represented  via  A.  we  seek  a 
sparse  combinations  of  dictionary  atoms  that  approximate  the  signal 
in  the  feature  space:  <l>(z)  =  <l>(Y)zs  +  r3.  Here,  zs  £  Rn  indicates 
the  current  estimate  of  the  signal  z,  and  rs  is  the  current  residual. 

The  pseudo-code  for  KOMP  is  given  in  the  Fig.  2.  Let  Is  denote 
the  set  of  indices  of  selected  atoms.  In  the  first  step,  the  residual  is 
projected  onto  the  remaining  dictionary  atoms: 

rJ($(Y)ai)  =  ($(z)  -4>(Y)zs)T($(Y)ai) 

=  (K(z,Y)-K(Y,Y)zf)ai,  *  £  Is  (4) 

where,  with  a  slight  abuse  of  notation,  we  denote 

K(z,  Y)  =  [«(z,  yi),  k(z,  y2), . . . ,  k(z,  y„)]. 


Dictionary  Update:  Once  the  sparse  codes  for  each  training  data 
are  calculated,  the  dictionary  can  be  updated  such  that  the  error, 
|[<J>(Y)  —  <t>(Y)AX|jF  is  minimized.  Taking  the  derivative  of  this 
error  with  respect  to  A  and  after  some  manipulations,  we  obtain  the 
relation  A  =  XT(XXT)^1  which  leads  to  the  following  update: 
Afc+^XftXfcXjJ-^Xl. 

This  way  of  updating  the  dictionary  is  essentially  the  idea  behind 
the  MOD  method  [2],  As  discussed  in  [3],  one  of  the  major  draw¬ 
backs  of  the  MOD  method  is  that  it  suffers  from  the  high  complexity 
of  matrix  inversion  during  the  dictionary  update  stage.  Several  other 
methods  have  also  been  proposed  that  focus  on  reducing  this  com¬ 
plexity.  One  such  algorithm  is  K-SVD  [3],  Following  the  procedure 
of  K-SVD,  in  what  follows,  we  describe  a  more  sophisticated  way 
of  updating  the  dictionary. 

Kernel  K-SVD:  Let  and  xF  denote  the  fc-th  column  of  A  and 
the  j-th  row  of  X,  respectively.  The  error  ||<E>( Y)  —  <3?(Y)AX||F 
can  be  re-written  as: 


$(Y)-«T(Y)^aJx; 


3=1 


4>(Y)  (  1  -  E  a.l x7t  I  -4>(Y)(afcx£) 

V  j 

:||$(Y)Efc-$(Y)Mfc||* 


(6) 


where, 

I  a jxJT  I  ;  Mfc  =  (afcxr).  (7) 

iltk  ) 

Efe  indicates  the  error  between  the  approximated  signals  and  the  true 
signals  when  removing  the  fc-th  dictionary  atom.  Mfc  indicates  the 
contribution  of  the  fc-th  dictionary  atom  to  the  estimated  signals. 

In  this  stage,  we  assume  that  only  (afe,  x^)  are  variables  and  the 
rest  are  fixed,  hence  Efe  is  also  constant  for  each  k.  Minimization 
of  the  above  problem  is  equivalent  to  finding  (a fc,  Xy)  such  that  the 
rank-1  matrix  3>(Y)Mfc  best  approximates  <f>(Y)Efc.  The  optimal 
solution  can  be  obtained  via  SVD.  However,  there  are  two  reasons 
that  make  direct  SVD  decomposition  inappropriate.  Firstly,  it  would 
yield  a  dense  vector  x^,  which  increases  the  number  of  non-zeros 
in  the  representation  X.  Secondly,  the  matrix  might  have  infinitely 
large  row  dimension,  which  is  computationally  prohibitive. 

In  order  to  minimize  the  objective  function  while  keeping  the 
sparsities  of  all  the  representations  fixed,  we  work  only  with  a  sub¬ 
set  of  columns.  Note  that  the  columns  of  Mfc  associated  with  zero- 
value  elements  of  x^  are  all  zero.  These  columns  do  not  affect  the 
minimization  of  the  objective  function.  Therefore,  we  can  shrink 
the  matrices  Efe  and  Mfc  by  discarding  these  zero  columns.  An  ad¬ 
vantage  of  working  with  the  reduced  matrices  is  that  only  non-zero 
coefficients  in  xj.  are  allowed  to  vary  and  therefore  the  sparsities  are 
preserved  [3]. 

Define  ujk  as  the  group  of  indices  pointing  to  examples  {"^(yi)} 
that  use  the  atom  (<E>( Y) A)fc :  u>k  =  {i|l  <  i  <  K,  Xy(i)  ^  0}. 
Let  fife  be  a  matrix  of  size  n  x  \uJk\,  with  ones  on  the  (wfc(i),  *)- th 
entries  and  zeros  elsewhere.  When  multiplying  with  fife,  all  zeros 
within  the  row  vector  x^  will  be  discarded  resulting  in  the  row  vec¬ 
tor  x^  of  the  length  | u>k  \  .  The  column-reduced  matrices  are  obtained 
as  Efe  =  Efcflfc;  M^  =  Mfcflfe. 

We  can  now  modify  the  cost  function  in  (6)  so  that  its  solution 
has  the  same  support  with  the  original  x^: 

||$(Y)E«  -  $(Y)M?||*  =  ||$(Y)E?  -  $(Y)afcxk||^  .  (8) 

Recall  the  fact  that  <F(Y)afc.x^  is  a  rank-1  matrix.  Therefore, 
the  optimal  solution  of  (8)  can  be  obtained  by  first  decompos¬ 
ing  <F(Y)E^  into  rank-1  matrices  using  SVD,  and  then  equating 
,!>(Y)afcxJ(  to  the  rank-1  matrix  corresponding  to  the  largest  singu¬ 
lar  value.  That  is, 

$(Y)Ef  =  UEVT  (9) 

<F(Y)afcX^  =  aiuivf ,  (10) 

where  ui  and  vi  are  the  first  columns  of  U  and  V  corresponding  to 
the  largest  singular  value  <n  =  E(l,  1),  respectively.  A  valid  break¬ 
down  for  the  assignment  (10)  is  given.  The  reason  for  putting  the 
multiplier  a\  in  Eq.  (11)  instead  of  in  Eq.  (12)  will  become  clearer 
when  solving  for  a*.  Basically,  such  assignment  guarantees  that  the 
resulting  dictionary  atom  on  the  feature  space  is  normalized  to  unit- 
norm 

x/,.  -  oiv[  (11) 

<f>(Y)afe=m.  (12) 

However,  as  mentioned  before,  we  can  not  perform  direct  SVD 
decomposition  on  $(Y)E^  as  in  (9)  since  this  matrix  can  have  in¬ 
finitely  large  row  dimension.  A  remedy  for  this  issue  comes  from  the 


fact  that  SVD  decomposition  is  closely  related  to  eigen  decomposi¬ 
tion  of  the  Gram  matrix,  which  is  independent  of  the  row  dimension. 
It  is  easily  seen  that 

($(Y)E? )T(S>(Y)E«)  =  (Efe  )TK(Y,  Y)(e£)  =  V  A  VT, 

where  A  =  STS  G  Rnxn.  This  gives  us  vi  as  the  first  column  of 
V,  and  <ti  =  \J A(l.  1).  Hence,  x^  can  be  found  using  the  relation 
in  (11). 

To  solve  for  at,,  we  first  observe  that  by  right-multiplying  both 
sides  of  (9)  by  V  and  considering  only  the  first  column,  we  get 

$(Y)Efe  vi  =  (Jiui.  (13) 

The  solution  for  a*  is  obtained  by  substituting  Eq.  (12)  into  Eq.  (13) 
d>(Y)E^vi  =  uiT>(Y)afe.  Hence,  afe  =  aj'1E^vi.  One  can  eas¬ 
ily  verify  that  this  updating  procedure  of  a k  results  in  a  dictionary 
atom  of  unit-norm  on  the  feature  space.  The  pseudo-code  for  kernel 
K-SVD  algorithm  is  given  in  Fig.  3. 


Input:  A  set  of  signals  Y,  a  kernel  function  k. 

Task:  Find  a  dictionary  via  A  to  represent  these  signals  as  sparse  decompositions  in 
the  feature  space  by  solving  Eq.  (1). 

Initialize:  Set  To  random  elements  of  each  column  in  X  to  be  1.  Set  iteration 
J  =  1. 

Stage  1:  Sparse  coding 

Use  the  KOMP  algorithm  described  in  Fig.  2  to  obtain  sparse  coefficient  matrix  X 
given  a  fixed  dictionary  A. 

Stage  2:  Dictionary  update 

For  each  column  k  —  1,  2,  .  .  .  ,  K  in  A^  J_1\  update  it  by 

-  Define  the  group  of  examples  that  use  this  atom,  c =  {i|  1  <  i  <  N ,  ( i )  ^ 

0} 

-  Define  the  representation  error  matrix,  E*.,  by  (7). 

-  Restrict  E&  by  choosing  only  the  columns  corresponding  to  cok,  and  obtain  Ej^  as 
Ej^  —  Efcfifc 

-  Apply  SVD  decomposition  to  get  (E^ )tK(Y,  Y)(e£)  =  V  A  VT.  Choose 
updated  =  cr^Ej^vi,  where  vi  is  the  first  vector  of  V  corresponding  to  the 
largest  singular  value  <r^  =  A(l,  1). 

-  Set  J  =  J  +  1 

Output:  A  and  X. 

Fig.  3.  The  kernel  K-SVD  algorithm. 

3.  EXPERIMENTAL  RESULTS 

First  we  present  two  synthetic  experiments  to  examine  the  effective¬ 
ness  of  a  learned  dictionary  in  the  feature  space.  The  following  pa¬ 
rameters  are  used  to  learn  dictionaries  using  both  K-SVD  and  kernel 
K-SVD:  dictionaries  are  learned  with  30  atoms.  To  =  3,  polyno¬ 
mial  kernel  of  degree  2  is  used,  the  maximum  number  of  training 
iterations  is  set  to  80. 

The  first  synthetic  experiment  is  done  with  two  classes  of  data. 
In  each  class,  1500  data  samples  are  randomly  generated  from  a  2- 
dimensional  circle  {y  =  [z/i ,  J/2]  G  R2  |  Vi  +  y%  =  r2}.  The  radius 
r  of  the  first  circle  (class  1)  is  half  that  of  the  second  circle  (class  2). 
The  first  figure  in  the  left  column  of  Fig.  4  shows  the  color-coded 
map  of  error  ratio  obtained  by  dividing  the  reconstruction  errors  of 
the  second  class  by  those  of  the  first  class  for  all  points  on  the  R2 
plane.  Since  data  samples  from  the  two  classes  lie  roughly  on  the 
same  linear  subspace,  which  is  the  entire  plane  in  R2,  dictionaries 
learned  using  K-SVD  are  indistinguishable  for  the  two  classes.  This 
is  clearly  seen  from  this  figure  where  error  ratios  are  quite  random 
even  for  points  lying  on  the  circles. 

On  the  contrary,  as  can  be  seen  from  the  first  figure  in  the  second 
row  of  Fig.  4,  the  error  ratios  corresponding  to  a  dictionary  learned  in 
the  feature  space  exhibit  strong  differences  between  the  two  classes. 
In  particular,  error  ratios  are  very  high  for  points  lying  close  to  the 


first  class,  and  very  small  for  points  lying  close  to  the  second  class. 
Moreover,  points  on  the  same  circle  have  similar  error  ratios.  This 
observation  implies  that  kernel  K-SVD  correctly  learns  the  nonlinear 
structure  of  the  data  and  embeds  this  information  into  kernel  dictio¬ 
nary  atoms. 


Fig.  4.  Left:  Comparison  of  error  ratio  for  K-SVD  and  kernel  K- 
SVD  (common  logarithm  scale).  Right:  Comparison  between  con¬ 
tours  of  linear  K-SVD  and  kernel  K-SVD  for  three  different  dictio¬ 
nary  atoms.  In  both  figures,  the  first  row  corresponds  to  K-SVD  and 
the  second  row  corresponds  to  kernel  K-SVD. 

In  the  second  synthetic  experiment,  we  learn  a  dictionary  from 
1500  data  samples  generated  from  a  2-dimensional  parabola  {y  = 
[j/i,  2/2]  G  R2  |  y2  =  2/i}-  Columns  2-4  in  Fig.  4  show  level  curves 
of  the  projection  coefficients  for  three  different  dictionary  atoms. 
The  level  curves  are  obtained  as  follows.  First,  we  project  every 
point  y  G  R2  onto  the  selected  dictionary  atom  to  get  the  projection 
coefficients.  Then,  points  with  the  same  projection  coefficients  are 
grouped  together  and  are  shown  with  the  same  color  map.  Coeffi¬ 
cients  of  the  kernel  K-SVD  (Bottom  row  of  columns  2-4  in  Fig.  4) 
change  most  dramatically  along  the  main  directions  of  data’s  vari¬ 
ation,  while  coefficients  of  the  linear  K-SVD  do  not.  Again,  this 
observation  implies  that  our  dictionary  learning  method  can  provide 
good  representation  for  data  with  non-linear  structures. 

Digit  Recognition:  In  recent  years,  there  has  been  a  great  inter¬ 
est  in  applying  dictionary  learning  methods  for  classification.  To 
this  end,  we  apply  our  approach  on  the  real-world  handwritten  digits 
classification  problem.  We  use  the  USPS  database  which  contains 
ten  classes  of  256-dimensional  handwritten  digits.  For  each  class, 
we  randomly  select  300  samples  for  training  and  200  samples  for 
testing.  We  use  the  following  parameters  for  learning  dictionaries: 
dictionaries  are  learned  with  500  atoms.  To  =  5,  polynomial  kernel 
of  degree  4  is  used,  maximum  number  of  training  iterations  are  set 
to  80. 

We  use  the  generative  approach  to  do  classification.  In  partic¬ 
ular,  digits  are  classified  to  the  classes  that  give  the  smallest  recon¬ 
struction  error.  Let  {A;};=1  denote  the  learned  dictionaries  for  c 
classes.  Given  a  query  image  z,  we  first  perform  KOMP  on  each  A; 
to  get  the  sparse  code  Xi.  The  reconstruction  error  is  then  computed 
as: 

n  =  ||$(z)  -^(YjAiXill2 

=  K(z,  z)  -  2K(z,  Y)AiXi  +  XiTAf  K(Y,  Y)AiX,.  (14) 

For  kernel  PCA,  we  project  the  query  image  z  onto  the  first  500 
principal  components  and  train  a  linear  SVM  classifier  using  these 
coefficients  for  classification.  Note  that  in  the  case  of  K-SVD,  kernel 
MOD  and  kernel  K-SVD,  dictionaries  are  trained  separately  for  each 
class  while  kernel  PCA  uses  all  training  samples  to  obtain  projection 
coefficients.  For  fair  comparisons,  we  have  also  obtained  projection 
coefficients  by  training  separate  dictionaries  for  each  class  using  ker¬ 
nel  PCA. 


(a)  Gaussian  noise  (b)  Missing  pixels 

Fig.  5.  Comparison  of  digit  recognition  accuracies  for  different 
methods  in  the  presence  of  Gaussian  noise  and  missing-pixel  effects. 

The  first  experiment  presents  the  results  for  the  situation  where 
test  samples  are  corrupted  by  random  Gaussian  noise  with  different 
standard  deviations  as  shown  in  Fig.  5(a).  Fig.  5(b)  shows  the  results 
obtained  when  pixels  are  randomly  removed  from  the  test  images. 
In  both  experiments.  Kernel  K-SVD  and  kernel  MOD  consistently 
outperform  linear  K-SVD  and  kernel  PCA.  As  the  distortion  level  in¬ 
creases  the  performance  difference  between  kernel  dictionaries  and 
linear  dictionaries  become  more  dramatic. 

4.  DISCUSSION  AND  CONCLUSION 

We  have  presented  two  non-linear  dictionary  learning  algorithms 
that  exploit  sparsity  of  data  in  high  dimensional  feature  space 
through  an  appropriate  choice  of  kernel.  It  is  shown  that  kernel 
methods  improve  the  separating  margin  between  dictionaries  and 
allow  better  tolerance  against  different  types  of  degradations.  Ex¬ 
perimental  results  indicate  that  exploiting  nonlinear  sparsity  via 
learning  dictionaries  in  the  feature  domain  can  provide  better  dis¬ 
crimination  than  the  traditional  linear  approaches  and  kernel  PCA. 
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